Cell
A combination of attributes under a given set of quasi-identifiers. E.g. given the set of quasi-identifiers \(\left\{Age,\ Marital\ Status,\ Occupation\right\}\), cells include \(\left(20-24\ years,\ married,\ farmer\right)\) and \(\left(25-29\ years,\ never\ married,\ nurse\right)\).
Cell count
The frequency of units belonging to a given cell, either in the population or in a microdata set. The former is the population cell count and the latter is the sample cell count (assuming the microdata set only contains a sample of population units). For example, if there are 2,600 married farmers who are aged 20-24 years old in the population but only 130 of them are included in the sample that makes up the microdata set, then the population cell count for the cell \(\left(20-24\ years,\ married,\ farmer\right)\) is 2,600 while the sample cell count is 130.
Confidentiality
The protection of privacy and secrecy of information collected from individuals and organisations. For the ABS, this means ensuring no data is released in a manner likely to enable their identification. The form of protection could include alterations of the data, controls placed on the user, controls placed on the access environment, controls placed on the project for which the data is being used, and/or controls placed on the outputs from the project.
Cross-sectional microdata
Microdata collected at and pertaining to a point in time. Each population unit corresponds to at most one record in the cross-sectional microdata set.
De-identified
Personal information is de-identified if the information is no longer about an identifiable individual or an individual who is reasonably identifiable. This is different to ‘unidentified’. Personal information is unidentified when direct identifiers are removed or altered into an unidentifiable form. Unidentified data often requires further controls to be considered de-identified, such as controls from the Five Safes.
Direct identifiers
Variables that unambiguously identify individuals in microdata set. E.g. name, address, tax file number.
Disclosure
The identification of a person or organisation in a supposedly de-identified dataset, or the attribution of information in the data to them. The former is called re-identification and the latter is called attribute disclosure. Disclosure risk is the probability that disclosure occurs.
Intruder
A person who deliberately attempts to breach protection measures that have been applied to some data, with the aim of gaining information about a person or organisation to which the data relates.
Microdata set
Dataset in which each row is a record belonging to a population unit (usually an individual or an organisation) and each column is a variable that contains information about an attribute of the population units. Multiple records may belong to the same population unit. Not all population units are necessarily included in the dataset (i.e. the dataset may contain data for only a sample of population units).
Personal information
Information or an opinion about an identified individual, or an individual who is reasonably identifiable, whether the information or opinion is true or not, and whether the information or opinion is recorded in material form or not.
Population
The set of real-world units from which a dataset is drawn.
Population unique
A population unit that has a unique combination of attributes in the population under a given set of quasi-identifiers.
Quasi-identifiers
Variables in a microdata set that alone do not lead to re-identification of an individual, but when considered together may allow re-identification. These variables are often considered public information that could be known by an external entity. E.g. age, marital status, occupation.
Re-identification
The identification of a person or organisation in a supposedly de-identified dataset. Re-identification risk is the probability that this occurs.
Sample
A subset of units from a population.
Sample unique
A population unit that has a unique combination of attributes in a given sample under a given set of quasi-identifiers.
Statistical disclosure control (SDC)
Statistical methods that alter the data by changing or suppressing some values in the data or in the outputs produced from the data, for the purpose of maintaining data confidentiality. Units with high disclosure risk may be treated by having their values rounded, perturbed, top/bottom-coded, suppressed, swapped with another unit, or the unit may be removed, among other possible treatments.
Unit
An entity that could be an individual, a household, an organisation, etc. For cross-sectional microdata, each unit corresponds to at most one record in the microdata set, so ‘unit’ and ‘record’ can be used interchangeably.
Utility
The value of a dataset for analytical and research purposes, referring to both the completeness of the dataset and the accuracy of the values within.